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Abstract 

Super-resolution methods form high-resolution images from low-resolution images. In this paper, 
we develop a new Bayesian nonparametric model for super-resolution. Our method uses a beta-Bernoulli 
process to learn a set of recurring visual patterns, called dictionary elements, from the data. Because it 
is nonparametric, the number of elements found is also determined from the data. We test the results on 
both benchmark and natural images, comparing with several other models from the research literature. 
We perform large-scale human evaluation experiments to assess the visual quality of the results. In a 
first implementation, we use Gibbs sampling to approximate the posterior. However, this algorithm is 
not feasible for large-scale data. To circumvent this, we then develop an online variational Bayes (VB) 
algorithm. This algorithm finds high quality dictionaries in a fraction of the time needed by the Gibbs 



September 25, 2012 



DRAFT 



2 



sampler. 

Index Terms 

Bayesian nonparametrics, factor analysis, dictionary learning, variational inference, gibbs sampling, 
stochastic optimization, image super-resolution. 

I. Introduction 

The sparse representation of signals with a basis is important in many applications. It has been 
extensively used in image denoising, inpainting, super-resolution, classification and compressive 
sensing ffl, 0, 0, Q, 0, 0, 0, 0. 

Many real data sets can be sparsely represented in some basis; typically this basis itself has to 
be learned from the data 0, £9, 0, 0, 0, 0, 0, DDI, GH, 113. For example, an image 
can be represented by weighted combinations of recurrent patterns of pixels. This construction 
may be beneficial, both while building a model for more accurate representation of the data (e.g. 
superior image denoising models) and while deriving and implementing an inference procedure 
for more efficient algorithms. 

In this paper we consider image super-resolution (SR), the problem of recovering a high- 
resolution (HR) image from a low-resolution (LR) image. It has many applications, e.g., to 
smart phones, surveillance cameras, medical imaging, and satellite imaging. 

There are a variety of approaches for image super-resolution. In general, rendering an HR 
image from an LR image has many possible solutions. We must use regularization of some 
form, i.e., prior information about the HR, to guarantee uniqueness and stability of the extension. 
For this purpose, researchers have proposed several methods [13 j, lH4ll . Interpolation-based 
methods, such as the Bicubic method and Bilinear method, often over-smooth images, losing 
detail. Example-based approaches use machine learning to avoid this 031 . lfT6l . IH71 : they train 
on ground-truth HR and LR images, learning a statistical relationship between the two. These 
relationships are later used to reconstruct unknown HR images from corresponding LR images. 
Freeman et al. ( 10310 proposes a method that stores a training set of preprocessed patches 
and uses a nearest-neighbor search to super-resolve. Kim et al. ( ifTSlh proposes using kernel 
ridge regression with a regularized gradient descent. Another class of SR algorithms use texture 
similarity to match image regions with known textures iHSll . lfT9l . Finally, there are methods for 
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single-image super-resolution. One classic example is [20] which uses recurring patterns at same 
and different scales in a single image. 

In this work, our focus is on SR via example-based sparse coding. ScSR (Super-resolution via 
Sparse Representation) is such an algorithm pioneered in (l2TTl . This algorithm is based on sparse 
coding via LI regularized optimization. In ll2T1l . image data are represented using a collection 
of dictionary elements (recurring patterns of pixels) that are weighted across different positions. 
Although very powerful, this model requires one to specify the number of dictionary elements 
and the variance of the noise model in advance — parameters that may be difficult to assess 
for real-world images. It also only provides a batch learning algorithm, i.e., computing model 
parameters via a gradient descent algorithm on a fixed small subset of the data. 

Bayesian nonparametric methods circumvent all these limitations. These methods adapt the 
structure of the latent space to the data and provide a powerful representation because they 
infer parameters that otherwise have to be assigned a priori (23, E3, EH, [ESI, G6), Il27l . 
GEL W{, GQl EEL 03. The full posterior distribution can be approximated via MCMC or 
variational inference, yielding sparse representations and learned dictionaries. 

Bayesian nonparametric methods have been used in many image analysis applications: to learn 
deep architectures used for object recognition in Il22l . for image inpainting and denoising in Il28lh 
||29L for image segmentation in POL ll3TTl . and to learn nonparametric multiscale representations 
of images in Il32ll . 

In this paper, we develop a Bayesian nonparametric method for super-resolution. We show 
that inference in our model is feasible, performing super-resolution with both a sampling based 
algorithm and an online variational inference algorithm. In the latter, we approximate the posterior 
distributions via a stochastic gradient descent over a variational objective that enables us to use 
the full data set and process the data segment by segment. We also provide human evaluation 
experiments which shows that signal-to-noise ratio (a typical quantitative measure of success 
in image analysis applications) is not necessarily consistent with human judgement. We devise 
a new model, new algorithms, and study a human-based evaluation. We make the following 
contributions: 

• We develop a sparse Bayesian nonparametric model for SR, learning the number of dictio- 
nary elements and the noise variance from the data. 

• We develop an online variational Bayes (VB) algorithm finding high quality "coupled 
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Fig. 1. Depicting the observations extracted (e.g. image patches) from high and low resolution images. 



dictionaries" in a fraction of the time needed by traditional inference. 
• We devise large scale human evaluation experiments to explicitly assess the visual quality 
of results. 

Our approach to SR gives a rich nonparametric representation with scalable learning. 

The remainder of the paper is organized as follows: Section [II] describes the proposed super- 
resolution model and non-parametric prior, Section [HI] contains the derivation of the posterior 



inference algorithms, Section [TV] presents the experimental results and implementation details, 
Section |V] includes the discussion and future work. 



II. Proposed Approach 

Bayesian factor analysis can be used to learn factors / dictionaries from natural images. Zhou 
et al. ( 112810 used beta process factor analysis in image denoising, inpainting and compressive 
sensing. These models learn both the dictionary elements and their number from the data. 

We build here a nonparametric factor analysis model that couples an HR image to a corre- 
sponding LR image. In training, we learn the HR/LR relationship from observed HR/LR pairs. 
To perform super-resolution, we condition on an observed LR image and compute the conditional 
expectation of its corresponding HR image. A more detailed description of the training process is 
as follows. We create training data by taking observed HR images and forming corresponding LR 
images. Figure [T] depicts the preprocessing and data extraction steps. We first down-sample the 
HR images. Then, we up-sample those by interpolating with a deterministic weighting function 
(e.g. bicubic interpolation). We extract same-sized patches from the same locations of both the 
HR and interpolated LR images, and consider those patches as coupled to each other. These are 
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the data on which we train the model. 

In the model, each small patch is generated from latent global dictionary elements — small 
images functioning as factor loadings — using local sparse weights and Gaussian noise. We will 
first explain how these latent variables are generated and then present how they are used to 
generate the observations. 

We learn two dictionaries: one for high resolution images and one for low resolution images. 
In terms of notation, represents the LR dictionary element, and dj^ is the HR dictionary 
element. P® and P^ represent the dimensionality of the low and high resolution dictionary 
elements, respectively. To model each dictionary element, we use a zero-mean Gaussian distri- 
bution, 

d<° ~ ATfrpV-'lpn) df } ~ AT^P^Ipw). 

The matrix form of the dictionaries are and where kth columns of those matrices are 
and d^, respectively. 

Following ll2T1l . we assume that the sparse weights are shared by both resolution levels for 
combining dictionary elements to produce images. This is the key property of the model that 
allows us to frame super-resolution as inference. Sparse weights have two components: real 
valued weights s ik and binary valued assignments z ik . To model the weights s ik , we use a zero- 
mean Gaussian distribution with precision a. z$ is a binary vector that encodes which dictionary 
elements are activated for the corresponding observation. p(z) represents the prior of this variable 
and we will elaborate on this in next section. These are given as 

Sik ~ A/*(0, 1/a) z ik ~ p(z ik ). 

We place Gamma priors on the precisions of the sparse weights and observation noise (a and 
7). The two resolution levels share these variables as well, 

7 ~ Gamma(c, d), - Gamma(e,/). 

Let x-^ and x 2 - represents patches extracted from HR and LR images, respectively, as shown 
in Figure [TJ Given the (global) dictionary elements and (local) sparse weights, the observations 
are modeled as 

ef ~M(0 n -% w ) cS h) ~JV(0,7 _1 Ipw) 
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where {(/), (h)} represents LR and HR, respectively. Here, TV is the total number of patches, and 
represents the element- wise multiplication of two vectors. Figure [T] illustrates the graphical 
model. 

To use this model in SR, we must be able to compute the posterior distributions of the hidden 
variables. In the training phase, we must compute the posterior distributions p(D^ , | {x-^ , x-^ }) 
of the dictionaries, given a collection of HR/LR image pairs. In testing, we use their posterior 
expectation to reconstruct a held-out HR image from an LR image, 

E[xf|xf,{xf',xf}].D" ,) (s i G%) (1) 

where is the mean of the posterior distribution p(D^\{x[ h \xf^}) and (Sj 0%) are the 
posterior expectation of the sparse weights from the LR image patches (x^) via posterior 



inference. (We discuss algorithms for posterior inference in Section III ) 



A. Beta-Bernoulli Process Prior (BP) 

We now discuss the prior for the factor assignments z^. We use a beta-Bernoulli process (BP) 
Ga,[l2!,[& prior on infinite binary matrices which is connected 

to the Indian buffet process (IBP). Each row encodes which dictionary elements are activated 
for the corresponding observation; columns with at least one active cell correspond to factors. 
The distinguishing characteristic of this prior is that the number of these factors is not specified 
a priori. Conditioned on the data, we examine the posterior distribution of the binary matrix to 
obtain a data-dependent distribution of how many components are needed. 

The IBP metaphor gives the intuition. Consider a buffet of dishes at a restaurant. Suppose there 
are infinite number of dishes and we are trying to specify the infinite binary matrix indicating 
which customers (observations) choose which dishes (factors/dictionary elements). In the Indian 
buffet process (IBP), TV customers enter the restaurant sequentially. Each customer chooses 
dishes in a line from a buffet. The first customer starts from the beginning of the buffet and 
takes from each dish, stopping after Poisson(r) number of dishes. The ith customer starts from 
the beginning as well, but decides to take from dishes in proportion to their popularity within the 
previous i — 1 customers. This proportionality can be quantified as ^ where m k is the number 
of previous customers who took this kth dish. After considering the dishes previously taken by 
other customers, the ith customer tries a Poisson(?) number of new dishes. Which customers 
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Fig. 2. Graphical Model. 

chose which dishes is recorded by the infinite binary matrix with TV rows (indicating the 
customers/observations) and infinite columns (indicating the dishes/factors/dictionary elements). 
One important (and surprising) property of this process is that the joint probability of final 
assignment is independent of the order of customers getting into the restaurant which is called 
exchangeability property of the prior [33] . 

The probabilistic construction is as follows. Each observation i is drawn from a Bernoulli 
process (a sequence of independent identically distributed Bernoulli trials), ~ BeP(£?) where 
B is drawn from a beta process B ~ BP(c ,5 ). B represents the base measure with B = 
Af(0, 1//3I). As K — > oc, the zth observation is x* = Y!k^i z ^d k where z ik denotes whether 
the dictionary element d k is used while representing the ith observation or not, and the sample 
from the beta process is given by B = 2a°=i n k^d k ' Here, n k represents the usage probability of 
dictionary element d k . 

In inference, we use a finite beta-Bernoulli approximation G5ll . The finite model truncates the 
number of dictionary elements to K and is given by 

n k - Beta(c r? , Q)(l ~ z ik - Bernoulli (7v k ) 

where c and t] are scalars and k e 1, . . . K. As K tends to infinity, the finite beta-Bernoulli 
approximation approaches the IBP/BP. If the truncation is large enough, data analyzed with this 
prior will exhibit fewer than K components Il23ll . 
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B. Super-resolution via Posterior Distributions 

Our algorithm has 2 stages: fitting the model on pairs of HR and LR images, and super- 
resolving new LR images to create HR versions. 

Training: Coupled Dictionary Learning Stage. In training, we observe x-^ and xf\ All 
other random variables are latent. The key inference problem to be solved is the computation of 
the posterior distributions of the hidden variables. In the training phase, we must compute the 
posterior distributions p(D^,D^|{x^\x®}) of the dictionaries given a collection of HR/LR 
image pairs. We rewrite the coupled model in a form similar to the single scale model: 

*-(?}*-($)•*-(?) 

where the superscript (c) corresponds to combination of (/) and (h). Writing the fully-observed 
model in this way reveals that we can train the dictionaries with similar methods as for the 
single-scale base model (Training amounts to approximating the posteriors of these values). The 
differences are that we use combined patches and combined dictionaries This leads 
to shared sparse weights for the two resolution levels. (The details of how we compute the 
distribution p(D (h \ D w |{xf } , xf } }) are discussed in Section EEl) 



Super-resolving a Low Resolution Image. With fitted dictionaries in hand, we now show 
how to form HR images from LR images via posterior computation. 

In this prediction setting, the HR image x-^ is unknown; the goal is to reconstruct it from the 
LR image patches xf\ the posterior estimates of the dictionaries (D^,D^), and the precisions 
7, a of the noise and the sparse weights, 

First we find estimates of the sparse factor scores, (§* ©z*), by using the LR image patches 
and posterior estimates of the dictionaries and precisions 7 and a. The fitted value of a 
determines the strength of a "regularization term" that controls the sparsity of the factor scores. 



More precisely, this prediction setting has 3 steps. The input is a set of held-out LR image 
patches xf\ the posterior estimates of the dictionaries (D^\ D^), and the precisions 7, a of the 
noise and the sparse weights. The steps are as follows: 
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1) We find estimates of the sparse factor scores, (s$ © z*), conditioned on the LR image 
patches xf* and estimates (D^,D^),7, a from the training stage. 

2) Eq. [I] determines the HR patches x-^. 

3) We replace each by its corresponding collocated x\ h \ the whole HR image, X^, is the 
pixel- wise average of those overlapping reconstructions. 

Post-processing: Following ll2T1l . we apply a post-processing step that, when down-sampled, 
the reconstructed HR image, X h \ should match the given LR image X®. Specifically, we solve 
the following optimization: 

X W * = argmin||/(X) -X«||| + C ||/(X) -X {h) \\l 

X 

where /() is a linear operator consisting of an anti-aliasing filter followed by down-sampling. 
This optimization problem is solved with gradient descent. 

III. Posterior Inference 

In the proposed approach, all of the priors are in the conjugate exponential family. In a first 
implementation, we use Gibbs sampling. We iteratively sample from the conditional distribution 
of each hidden variable given the others and the observations. This defines a Markov chain 
whose stationary distribution is the posterior ll34l . The corresponding sampling equations are 
analytic and provided in the appendix A-B (appendix is in the supplementary material). 

The Gibbs sampler has difficulty with scaling to large data, because it must go through many 
iterations, each time visiting the entire data set before the sampler mixes. For this reason, both 
our Gibbs sampler and ScSR use 10 5 patches sampled from 3 x 10 6 . We now develop here 
an alternative algorithm to Gibbs sampling for SR that scales to large and streaming data. 
Specifically, we develop an online variational inference algorithm. 

Variational inference is a deterministic alternative to MCMC that replaces sampling with 
optimization. The idea is to posit a parameterized family of distribution over the hidden variables 
and then optimize the parameters to minimize the KL divergence to the posterior of interest |[35l . 
Our algorithm iteratively tracks an approximate posterior distribution, which improves as more 
data are seen. 

In typical applications, the variational objective is optimized with coordinate ascent, iteratively 
optimizing each parameter while holding the others fixed. However, in Bayesian settings, this 
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suffers from the same problem as Gibbs sampling — the entire data set must be swept through 
multiple times in order to find a good approximate posterior. In the algorithm we present here, 
we replace coordinate ascent optimization with stochastic optimization — at each iteration, we 
subsample our data and then adjust the parameters according to a noisy estimate of the gradient. 
Because we only to subsample the data at each iteration, rather than analyze the whole data 
set, the resulting algorithm scales well to large data. This technique was pioneered in 06l 
and was recently exploited for online learning of topic models (37] and hierarchical Dirichlet 
processes ll38ll . 

We first develop the coordinate ascent algorithm for the coupled model. Then we derive the 
online variational inference algorithm, which can more easily handle large data sets. 



A. Variational Inference for the Coupled model 



We use the coupling perspective in Section |II-B| to derive the batch variational Bayes (VB) 
algorithm. The single-scale base model is the BPFA model of [26], which gives a mean-field 
variational inference algorithm. The batch VB algorithm derived here is the coupled version of 
that. 

We first define a parametrized family of distributions over the hidden variables. Let Q = 
{7T, Z, S, D, 7, a) denote the hidden variables for alH, k. We write coupled data as in Equation 
^ in the new set-up the variables to be learned become Q = {7r, Z, S, D^ c \ 7, a}. We use a fully 
factorized variational distribution, 

q(Q) = q T (7r)q 4> (D^)q u (Z)q e (S)q x ( 1 )q e (a). 

Each component of this distribution is governed by a free variational parameter, 

qr k (n k ) = Beta(Tfci,Tfc 2 ) Q^ k ( z ik) = Bernoulli (V i/c ) 

% k3 ( d kj) = N(0kj, $kj) <7a(t) = Gamma(Ai, A 2 ) 

qe ik (s ik ) = N(0ik, Bifc) 5c («) = Gamma(ei, e 2 ) 

We optimize these parameters with respect to a bound on the marginal probability of the 
observations. This bound is equivalent, up to a constant, to the negative KL divergence between 
q and the true posterior. Thus maximizing the bound is equivalent to minimizing KL divergence 
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to the true posterior. Let S = {c ,%,c, rf,e,/} be the hyper-parameters. The variational lower 
bound is 

log(p(X( c )|H)) > H(q) + Y! < k=1 {E q [logp(7T fc |co,?7o,^)] 

+ Sill EqpOgptefclTT)] + Y> J j=1 E q[ lo § (p( d kj\/3 kj ))] 

+ 2^ E q [log (p(s ik \a)p(a\e, /))]} (3) 

+ H=i {E q [logp(xf|Z,S,D( c ) 7 )] + E q [logp( 7 |c, d)]}, 

where H(g) is the entropy of the variational distribution and dimensionality of the dictionary 
elements J is twice as big as the single-scale model. We denote this function C(q). 

Holding the other parameters fixed, we can optimize each variational parameter exactly; this 
gives an algorithm that goes uphill in C(q) Il39l . (Further, this will provide the algorithmic 
components needed for the online algorithm of Section |III-B[ ) 



Update equations for each free parameter optimizing this bound are given below. In all 

i(-k) 



(c) 

equations, I P represents P x P identity matrix, and x\_ k) represents the reconstruction error 



using all but the kth dictionary element, that is 

t{- k ) = x ! c) - ° (c) (s* © + 4 C) (^ © *k). 

The expectation based on the variational distribution is then given by 

k=l 

Update for the binary factor assignment z ik : The variational parameter for factor assignment 
%ik i s Vik. We first consider two values of the variational distribution for two values (0,1) of z ik , 

q(z ik = 1) cc exp(Eg[ln(7r fc )])exp^ 

q(z ik = 0) cc exp(E,[/n(l - 7r fc )]), where 

E,pn(7r fc )] = ^(c ?7o + Yj Vik ) ~ ^( c ° + N ) 

i 

E,pn(l-7r fc )] = V(c (l-%)-2]^ + iV )-^( c o + iV ) 
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Then the update equation for the variational parameter v ik is given as 



q(zik = i| 


-) 


q{zik = i 


-) + q(zik = 01 1 


-) 



Update for the shared sparse weight s ik : The variational distribution for the sparse weight 
s ik is Gaussian parametrized with mean ik and variance Q ik . Coordinate ascent update equation 
for these free variational parameters are 

Oik = ^e^^0i c)T E g [x ( ^ } ]. 

Update for the kth coupled dictionary element dj^: The variational distribution for the 
couple dictionary element dj^ is Gaussian parametrized with mean <f$ and variance & k \ 
Coordinate ascent update equation for these free variational parameters are 



k 



(c) - (tPiw + ^tfa + eM)' 



X2 i=i 



i=i 



The updates for high resolution (h) and low resolution (1) components can be given separately 
as 



2 i=l 2 i=l 

, AT x AT 



Vfc = l_i dikPikX i{-k) = Y k 2j 0ikUikX i(-kV 

^ 2=1 ^ 2=1 

Update for the dictionary usage probabilities n k : The variational distribution for the 
dictionary usage probabilities n k is a beta distribution parametrized with the shape parameters 
CTteij Tfe2)- Coordinate ascent update equation for these free variational parameters are 

N 

Tkl = C T]o + ^ V ik 
i=l 
N 

r k 2 = N -^is ik + c (l-r) ). 
i=i 
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Update for the precision 7: The variational distribution for the precision 7 of the observation 
noise e { is a gamma distribution parametrized with (Ai, A 2 ). Coordinate ascent update equation 
for these free variational parameters are 



A x = c + NP 

( N ... K K 

A 2 = d + 



, N K K 

2 2 {n*i c) - S + S + ^)(<rf c)T <4 c) + 



2 

2 = 1 fc=l fc=l j 

k=i 

Update for the precision a: The variational distribution for the precision a of the sparse 
weights s ik is a gamma distribution parametrized with (ei, e 2 ). Coordinate ascent update equation 
for these free variational parameters are 

e 1 = e+^NK 



, N K 



--1 k=i 



Algorithm 1 Batch VB 



Sample TV observations from the data. Initialize r, z/, 0, 0, 0, A, e using Gibbs sampler, 
for t = 1 to T do 

Init. local variables v nh , nfc , nfc using Gibbs sampler, 
while relative improvement in i is large do 
for k = 1 to K do 
for n = 1 to TV do 

update z/ n £, n &, n /e by using batch VB updates, 
compute t&, A, e by batch VB updates. 



5. Online Variational Inference 

We now develop online variational inference. We divide the variational parameters into global 
variables and local variables. Global variables depend on all of the images. These are the 
dictionary probabilities ir k , dictionary elements d k , precisions ol and 7. Local variables are 
the ones drawn for each image. These are the weights s^, binary variables z*. The algorithm 
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iterates between optimizing the local variables using local (per-image) coordinate ascent, and 
optimizing the global variables. This same structure is found in many Bayesian nonparametric 
models E5I, 1140). 

The basic idea is to optimize Equation [3] via stochastic optimization ll4T1l . This means we 
repeatedly follow noisy estimates of the gradient with decreasing step sizes p t . If the step sizes 
satisfy Yjt Pt = 00 an d Yjt Pt < 00 then we will converge to the optimum of the objective. (In 
variational inference, we will converge to a local optimum.) 

Algorithm 2 Online VB with mini-batches 

Define p t = (r + t)~ K , Initialize r, 0, <I>, 0, 0, A, e using Gibbs sampler. 

for t = 1 to ^ do 

Sample Ns new observations from the data. Initialize local variables v nk , 6 Uk , nfc using Gibbs sampler, 
while relative improvement in £ is large do 
for k = 1 to K do 

for n t = (t - 1) x N s + 1 to t x N s do 

update ^ nt fc, nt /c 7 ®n t fc by using batch VB updates, 
compute ^fe, 4> k , Tfc, A, e by batch VB updates as if there are N /Ns copies of the images, 
for k = 1 to K do 

update ^ki4>k^ T ki\^ by Equation [6] 



The noisy estimates of the gradient are obtained from subsampled data. We write the objective 
C as a sum over data points. Defining the distribution g(n) which uniformly samples from the 
data, we can then write C as an expectation under this distribution, 



The gradient of the objective can be written as a similar expectation. Thus, sampling data at 
random and computing the gradient of £ n gives a noisy estimate of the gradient. 

There are two further simplifications. First, when we subsample the data we optimize the 
local variational parameters fully and compute the gradient of £ n with respect to only the global 
variational parameters. Second, we use the natural gradient Il42l rather than the gradient. In mean 
field variational inference, this simplifies the gradient step as follows. Suppose we have sampled 
an image n and fitted its local variational parameters given the current settings of the global 




(4) 



NE g [£(r, i/ nj 0, n , n , A nj e 



Xn)] 



(5) 
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Fig. 3. Dictionary trained in batch mode on lluminance channel with SR ratio = 2. (Left) HR Dictionary, (Right) LR 
Dictionary, Every square represents a dictionary element and the HR-LR pairs are co-located. HR dictionary consists of sharper 
edges. 



variational parameters. Let r, 0, <E>, A, e be the global variational updates from Section |III-A| as 
though we observed TV copies of that image. (Note that these depend on its local variational 
parameters.) Following a noisy estimate of the natural gradient of C is equivalent to taking a 
weighted average of the current and the newly fitted global parameters, e.g. 

= (l-p t )0 + p,0. (6) 

It follows that there is no additional computational cost to optimizing the global variational 
parameters with stochastic optimization versus coordinate ascent. 

In our implementation, we decrease the step-size p t by p t = (p + t)~ K . The learning rate 
parameter p down- weights early iterations; the parameter k controls the speed of forgetting 
previous values of the global variables. 

The full online VB algorithm is listed in Algorithm [2j (Note that we sample the data in 
mini-batches, rather than one at a time. When the mini-batch size is equal to one data point, we 
recover the algorithm as described above.) 
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C. Initialization with MCMC. 

We initialize both the batch and online VB with a few iterations (e.g. 5 or 10) of MCMCQThis 
is useful for two reasons: (1) It provides a good initialization and thus faster convergence, (2) 
Noisy random- walks of MCMC help VB avoid low-quality local optima: at the beginning of each 
e-step, MCMC initializes and by sampling from their approximate posterior distribution, 
given the most recent global variables. These samples are noisy estimates of the sparse weights 
near their posterior means. For instance, when the factor assignment z ik equals 0, the MCMC 
draws the sparse weight s ik from the prior 7V(0, 1/a) whereas in VB it would be exactly 0. 
Providing the freedom to "jiggle" gives the algorithm the opportunity, similar to simulated 
annealing, to jump away from one local optimum to reach a better optimum. 

IV. Experiments 

We use three data sets. To train, we use the set of 68 images collected from the web by lt2TTl . 
We test on the Berkeley natural image data set (20 100x100 images) and a benchmark set of 
images (11 images of various size) used by the community to evaluate SR algorithms^] These 
data sets provide us with a rich set of HR-LR pairs. 

Throughout this work, unless otherwise mentioned we use the same parameters (without any 
tuning): we set the SR ratio to 2 or 4 and the patch size to 8 x 80 The hyper-parameters are 
c = d = e = f = 10 -6 and c = 2, rj = 0.5, these are standard uninformative priors used in e.g. 
EH). The truncation level K in BP is set to 512. Most images use fewer factors, e.g. Baboon 
uses 487, House 438 and Barbara 471 factors. We apply all algorithms only to the illuminance 
channel and use Bicubic interpolation for the color layers (Cb, Cr) for all compared methods. 

We study our methods with two kinds of posterior inference — Gibbs sampling (BP) and online 

! For batch VB, these MCMC samples are collected on the same subset of the data on which batch VB will process. For 
online VB, they are collected from the mini-batches. In both cases, scale problem of MCMC is not an issue since we only 
collect few samples (e.g. 5 or 10). As we mentioned before, scale is a problem for MCMC since it needs to go over the data 
many times for convergence (e.g. thousands of iterations). Time scaling is discussed in more detail in Section IV-C 

2 We are using SR ratio=2 or 4. For SR ratio 2, the images which do not have even number of rows/columns are cut to have 
even number of rows/column to prevent any possible mismatch and error in computing PSNR in all algorithms. For instance 
the last column of pixels from an image of size 330 x 171 is excluded so the corresponding image have the size 330 x 170. 

3 The visual results for SR ratio 4 are in the appendix G. 
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TABLE I 

Test Results with SR ratio = 2. PSNR for the illuminance channel is presented (the higher the better). 
BP: Proposed algorithm trained via Gibbs sampler, O-BP Proposed algorithm trained via Online VB, 

SEEING MORE DATA, ScSR: SUPER-RESOLUTION VIA SPARSE REPRESENTATION [[211 . NNI: NEAREST NEIGHBOR 
INTERPOLATION, SME: SPARSE MIXING ESTIMATION |43l . 



PSNR 


Bic. 


NNI 


Bil. 


SME 


ScSR256 


ScSR512 


BP 


OBP 


Baboon 


23.63 


23.12 


23.05 


23.10 


24.33 


24.36 


24.27 


24.39 


Barbara 


25.35 


25.10 


24.92 


24.42 


25.88 


25.89 


25.98 


25.99 


Boat 


29.95 


28.39 


28.94 


29.72 


31.23 


31.29 


31.17 


31.31 


Camera 


30.32 


35.20 


28.94 


26.33 


30.68 


30.46 


31.51 


30.94 


House 


32.79 


30.34 


31.61 


33.28 


34.26 


34.31 


34.08 


34.27 


Peppers 


31.99 


29.88 


31.18 


33.06 


33.05 


33.06 


32.45 


33.08 


Parthen. 


28.12 


27.28 


27.42 


27.28 


29.05 


29.10 


28.96 


29.06 


Girl 


34.76 


33.44 


33.98 


33.98 


35.57 


35.58 


35.62 


35.66 


Flower 


40.04 


37.96 


38.94 


39.72 


41.06 


41.11 


41.26 


41.33 


Lena 


32.83 


31.00 


31.72 


33.57 


34.47 


34.54 


34.56 


34.68 


Raccoon 


30.95 


29.82 


29.95 


31.73 


32.39 


32.43 


32.43 


32.62 



variational inference (O-BP), which scales to larger data setsQ To compare, we study both 
interpolation and example-based algorithms. Bicubic interpolation is the gold standard in the SR 
literature. We also study nearest neighbor interpolation, bilinear interpolation and sparse mixing 
estimation (SME) K3ll . To compare with an example-based method, we use super-resolution via 

4 The software for each algorithm presented and all of the visual results will be publicly available. 
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sparse representation (ScSR, ll2TT0 . Both BP's and ScSR's dictionary learning stages use 
10 5 patches sampled from the training data, however O-BP uses the whole set in online fashion. 
The HR and LR dictionaries trained by our approach are shown in Figure |3j The HR dictionary 
consists of sharper edges. 

As a quantitative measure of performance we compute the signal to noise ratio (PSNR), a 
measure that is widely used in image recovery applications. We present the PSNR results for 
benchmark images in Table [I] and natural images in Table [II} These PSNR based results can be 
summarized as: (1) The online learning algorithm and ScSR performs similarly, (2) They both 
slightly perform better than the Gibbs sampler. (3) All of the example based algorithms perform 
better than the interpolation based techniques. 

A. Evaluation and Crowdsourcing via Mechanical Turk 

Though signal to noise ratio (PSNR), is a widely used metric in image recovery applications, 
this is not enough to measure human judgement. For this purpose, we also performed human 
evaluation experiments on Amazon Mechanical Turk (MTurk, http: //www.mturk.c om). 

The Amazon Mechanical Turk (MTurk) is a web interface for deploying small tasks to 
people, called Turkers. Typically an MTurk experiment works as follows: the requesters, people 
organizing the experiments and paying Turkers, prepare tasks called HITs (Human Intelligence 
Tasks). Each HIT might be a comparison of images, labeling of text etc. Once the HITs are 
completed, requesters can approve or reject the HITs based on their reliability measures (for 

5 We used the code and implementation provided by |2T1 . We also used their training images, in order to have a fair comparison, 
and we did not change any of their parameters (including noise variance). 

6 We provide visual comparisons to 1 15 1, 1 16 1, |20|, |44| in appendix H. |20| provides very sharp edges by artificially enhancing 
them. However, this makes images unrealistic (looking like graphically rendered). Sparse coding allows any single-image SR 
algorithm as a pre-processing step. Instead of bicubic interpolation (see Figure [TJ 1 20 1 might be used with sparse coding to 
boost the sharpness of the edges. 

7 The dependent hierarchical Beta process (dHBP), another bayesian nonparametric prior, is proposed in |29|. It removes the 
exchangeability assumption of beta-Bernoulli construction. This prior assumes that each observation i has a covariate £i e R £ . 
In this model, the closer the two sparse factor assignments z% and Zj in the covariate space, the more likely they share similar 
dictionary elements. It applies dHBP using spatial information as covariates to image inpainting and spiky noise removal, and 
shows significant improvement over BP. We obtained preliminary results with dHBP for super-resolution. However, in this setting 
we did not observe improvement over BP. 
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TABLE II 

Test Results with SR ratio = 2. PSNR for the illuminance channel is presented (the higher the better). 
BP: Proposed algorithm trained via Gibbs sampler, O-BP Proposed algorithm trained via Online VB, 

SEEING MORE DATA, ScSR: SUPER-RESOLUTION VIA SPARSE REPRESENTATION [[211 . NNI: NEAREST NEIGHBOR 

INTERPOLATION. 



PSNR 


Bic. 


NNI 


Bilin. 


ScSR256 


ScSR512 


BP 


O-BP 


Nl 


29.74 


27.44 


28.39 


31.52 


31.55 


31.52 


31.56 


N2 


29.52 


27.71 


28.27 


31.16 


31.20 


31.17 


31.20 


N3 


22.97 


21.95 


22.12 


23.94 


24.00 


23.80 


23.94 


N4 


21.63 


20.98 


20.90 


22.59 


22.66 


22.38 


22.41 


N5 


24.85 


23.85 


24.01 


26.01 


26.06 


25.77 


25.90 


N6 


25.34 


24.61 


24.70 


26.20 


26.26 


26.08 


26.07 


N7 


26.66 


25.43 


25.73 


27.92 


27.92 


27.77 


27.97 


N8 


26.08 


24.71 


25.23 


27.27 


27.43 


27.01 


27.26 


N9 


26.02 


25.29 


25.42 


26.82 


26.89 


26.58 


26.73 


N10 


24.79 


24.07 


23.92 


26.23 


26.25 


25.91 


26.16 


Nil 


26.86 


25.22 


25.97 


28.06 


28.04 


27.99 


28.16 


N12 


28.16 


26.65 


27.07 


29.63 


29.66 


29.78 


29.86 


N13 


25.15 


24.18 


24.22 


26.40 


26.36 


26.31 


26.33 


N14 


26.82 


25.98 


25.92 


27.99 


28.01 


27.86 


27.94 


N15 


25.78 


24.64 


24.81 


27.00 


27.04 


26.90 


27.06 


N16 


27.28 


25.85 


26.16 


28.88 


29.01 


28.83 


28.96 


N17 


27.79 


26.33 


26.81 


29.21 


29.24 


29.02 


29.16 


N18 


29.13 


27.75 


28.18 


30.38 


30.41 


30.25 


30.43 


N19 


24.57 


23.19 


23.50 


26.07 


26.10 


25.92 


26.02 


N20 


22.00 


21.13 


21.05 


23.26 


23.28 


23.26 


23.29 



instance trivial solution HITs, as we explain next, and the time spend on each HIT are frequently 
used measures for reliability). Approved results are acquired to be used in the analysis. 

While preparing HITs, we used the natural image data. We asked Turkers to visually assess 
and select the better of two HR reconstructions of each image. We considered all ordered 
combinations of the algorithms, each equally likely, e.g., BP vs ScSR, BP vs Bicubic etc. 
We initially collected 42, 807 decisions from 208 unique Turkers. For quality control we gave 
test pairs in which a ground truth HR image was used, i.e., a comparison of an algorithmic 
reconstruction vs a true HR image. All of the judgments of the Turkers who failed to pass this 
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Fig. 4. Human Evaluation via Mechanical Turk. (Top) Average win rate in one-to-one comparisons. (Bottom) Win rates 
for each one-to-one comparison. Each number represents the winning rate of the method in the column, e.g., 0.57 for BP vs 
ScSR (BP is on the column and ScSR on the row) means that on average, 0.57 of the times Turkers voted in favor of BP. 



test (Turkers who selected the algorithmic reconstruction instead of true HR) were removed. 
This reduced the data to 20,469 decisions from 161 unique reliable Turkers. 

The results of the human evaluation are in Figure [4} In the bottom table, win rates for each 
one-to-one comparisons are provided. Each number represents the winning rate of the method in 
the column. For instance, 0.93 for O-BP vs Nearest (O-BP is on the column and Nearest on the 
row) means that out of 100 binary comparisons of O-BP and Nearest, 93 of the times Turkers 
voted in favor of O-BP. In general, we observe that example-based methods perform significantly 
better than interpolation-based methods. Within the example-based approaches, the models are 
similar. However, our approach does not use the first and second-order derivative filters for the 
LR patches used by ScSR as features, yet we perform similarly; moreover we do not need to set 
the noise precision and the number of dictionary elements, both required parameters of ScSR 
(We used the parameters provided by lEHTl in ScSR.). 
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(a) High (b) Low (c) Bicubic (d) NNI (e) Bilinear 




(f) ScSR (g) BP (h) O-BP 



Fig. 5. Reconstruction of Natural Image 3. BP: Algorithm presented in this work trained via Gibbs sampler, O-BP Algorithm 
presented in this work trained via Online VB, ScSR: Super-Resolution via Sparse Representation. Example based approaches 
are superior to interpolation techniques, ScSR and our approach perform similarly. 

In the PSNR results, ScSR and O-BP seem to perform similarly and both slightly better than 
BP. However, in the human evaluation we observed that BP reconstructions are found to be 
better. (Based on 95% confidence intervals, both the BP vs O-BP and BP vs ScSR results are 
statistically significant. The O-BP vs ScSR difference is statistically insignificant.) This shows 
that PSNR is not necessarily consistent with the human assessment of images H51 . Sample 
visual results are shown in Figures [5j [6] and [7} (The remaining results are in the appendix E and 
F.) 
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(h) BP (i) O-BP 



Fig. 6. Reconstruction of Parthenon Image. BP: Algorithm presented in this work trained via Gibbs sampler, O-BP 
Algorithm presented in this work trained via Online VB, ScSR: Super-Resolution via Sparse Representation. SME: Sparse 
Mixing Estimation l43l 
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(a) High (b) Low (c) Bicubic 




(g) ScSR (h) BP (i) O-BP 



Fig. 7. Reconstruction of Baboon Image. BP: Algorithm presented in this work trained via Gibbs sampler, O-BP Algorithm 
presented in this work trained via Online VB, ScSR: Super-Resolution via Sparse Representation. SME: Sparse Mixing 
Estimation |43|. 
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Fig. 8. Learning the number of dictionary elements from the data. (Left) PSNR of the reconstruction of the Barbara image 
by nonparametric BP and parametric ScSR with different number of dictionary elements. (Right) Histogram of the number of 
dictionary elements for BP when K = 1024 over 100 samples. 



B. Nonparametric property of the model. 

In this section, we demonstrate the importance of a Bayesian nonparametric method for image 
super-resolution. As we mentioned in Section [D^Aj we use a beta-Bernoulli process (BP) for the 
factor assignments that encodes which dictionary elements are activated for the corresponding 
observation. In the binary matrix (whose rows are the factor assignment z/s), the columns with 
at least one active cell correspond to factors that are used. 

The distinguishing characteristic of this prior is that the number of the factors to be learned 
is not specified a priori. Conditioned on the data, we examine the posterior distribution of the 
binary matrix to obtain a data-dependent distribution of how many components are needed. For 
the parametric ScSR, the number of dictionary elements must be set a priori. This is illustrated 
by the following experiment. For both model, we train on 10 4 patches, for different values of 
K (starting from scratch each time); for ScSR, K is the target number (which needs to be set 
before starting the algorithm), while for our approach, K functions as an upper bound on the 
number of dictionary elements (which should not be too low). Figure [8] shows that, unlike ScSR, 
our approach is less sensitive to the value of K if it is sufficiently large. The Barbara image uses 
700, 801 and 816 factors in our approach for K equals to 1024, 2048 and 4096 respectively. 
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Fig. 9. Held-out prediction performances of Online Learning with different mini-batch sizes. Online- VB run on the whole 
data set is compared with the Batch- VB run on a subset of the data. The online algorithms converge much faster than the batch 
algorithm does. 



C. Online learning, Computational Time and Scaling 

In this section, we compare the scale properties of the algorithms presented in this paper. In 
online learning, instead of subsampling the patches during the dictionary learning stage, we use 
the full data set and process the data segment by segment (so called "mini-batches"). We use 



the training data set of Section [TV] The learning parameters are set to n = 0.501 and p = 3. 

Figure [9] shows the evolution of the mean PSNR on the held-out natural image data set 
by the online and the batch algorithms as a function of the number of image patches seen 
(visualizations of the learned dictionaries are provided in appendix D). The number of patches 
seen represents the computational time since both algorithms' time complexity is linear with 
number of observations. For online VB, the number of patches seen represents the total number 
of data seen after each iteration. For batch VB, this represents cumulative sum of the number 
of same data seen after each variational-EM iteration. Even before the second iteration of the 
batch VB (100K) is completed, online VB with 5K mini-batch converges - reaches to a local 
optima better than batch VB. This means that the online algorithm finds dictionaries at least as 
good as those found by the batch VB in only a fraction of the time. As also shown in Table 
[1} it finds high quality dictionaries. This may be because stochastic gradient is robust to local 
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optima [46] . 

For dictionary training, the convergence time for online VB with 5K mini-batch size is 16 
hours. In Gibbs sampling, we throw away the first 1500 samples for the burn-in period and 
later collect 1500 samples to approximate the posterior distributions. This takes approximately 
50 hours on the same machine with an unoptimized Matlab implementation on 10 5 number 
of patches. Running Gibbs sampling same amount of time with online VB, i.e. collecting less 
number of samples such as 500, reduces held-out PSNR between 0.2 dB to 0.5 dB, depending 
on the image. This is consistent with the findings in ESI . 

V. Discussion 

We developed a new model for super-resolution based on Bayesian nonparametric factor 
analysis, and new algorithms based on Gibbs sampling and online variational inference. With 
online training, our algorithm scales to very large data sets. We evaluated our method against 
a leading sparse coding technique ETTl and other state of the art methods. We evaluated both 
with traditional PSNR and by devising a large scale human evaluation. This is a new real-world 
application that can utilize online variational methods. 

The choice of the inference algorithm depends on the usage case. Our results suggest that with 
more computation time Gibbs sampling performs slightly better (based on human evaluation). 
If speed is important, our online algorithms can be used without much loss. 

Regarding the evaluation metric, the standard in image analysis has been signal-to-noise ratio 
(PSNR). However, its practical relevance has been questioned B51l . The human eye is sensitive to 
details which are not always captured in this metric, and that is why we ran a human evaluation. 
Our experiments show that the signal-to-noise ratio is not necessarily consistent with human 
judgement. 

As future work, our approach can be used as a building block in other, more complicated, 
probabilistic models. For example, our approach could be developed into a time series to perform 
super-resolution on video or a hierarchical model can be built that fully generates the whole image 
instead of patch based approach. 
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(b) 
Low 
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Fig. 10. (Natural Image 18) Test Set Results with SR ratio = 4. 
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Fig. 11. (Natural Image 19) Test Set Results with SR ratio = 4. 
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